# High Speed RNS MAC Unit Using Prefix Topology for Low Complexity DSP Applications

## <sup>1</sup>M.Saranya, <sup>2</sup>Mr.A.Vijayabrabhu,

<sup>1</sup>M.E, VLSI Design, <sup>2</sup>Assistant Professor Dept. of Electronics and Communication Engineering, Sri Venkateswara college of Engineering and technology, Tiruvallur.

**Abstract:** This paper, for the first time, we present prefixes topology based accumulation units with variable latency to link equations via parallel-prefix computation using various methodologies such as such as ripple carry adder, Kogge Stone adder, Brent Kung adder, Ladner Fischer adder and Han Carlson adder. This work is also permitted to design high-speed and unique MAC hardware structure using vedic multiplier, thereby making them suitable for any DSP applications. To prove the hardware efficiency of the Wallace tree multiplier unit it is compared with state-of-the-art methods like high speed vedic and shift and add based DA method. Moreover, this methodology has several attractive features such as simplicity, regularity and modularity of architecture. Also, the Chinese Remainder Theorem based Residue Number System is incorporated to overcome area constrains and high speed demand requirements of FIR filter design, where all bits of one tap unit are processed within the bounded delay. To reduce the complexity of filter, coefficients are represented in canonical signed digit representation as it is more efficient than traditional binary representation. The comparative analyzes is carried out using 20 order FIR filter. Hardware optimization in terms of area, delay and power of different prefix techniques, Vedic multiplier, add and shift method and Wallace tree (WT) multiplier are is analyzed using FPGA hardware synthesis. Finally performance efficiency of FIR filter design is proved using ECG signal denoising application.

Keywords: prefix adder, RNS, FIR design, DSP etc.

#### I. Introduction

High speed MAC units are inevitable in many real time applications like filter design, image processing, data acquisition and control [1]–[2]. In particular Digital Signal Processing (DSP) requires custom accelerators to perform computationally intensive arithmetic's. Typical DSP applications need to carry out a large number of MAC operations as their implementation is based on multiplier kernels. As expected, the overall system performance of DSP systems is considerably affected by both hardware complexity latency measures. It is investigated in many existing research works as arithmetic optimization models [3] have concludes that the design requires both high speed operations with significant improvements in complexity reduction.

Though multiplier is major hardware and power hungry digital blocks in most signals and image processing systems such as FIR filters, digital signal processor, microprocessors etc its metrics also depends on accumulation units. With advances prefix topology, many researchers have tried and strive the efficiency of MAC design which offers either of the following- high speed, low power consumption or more considerably less hardware combination, thus making them compatible for various high speed, low power, and compact VLSI implementations. However, area and speed are opposite conflicting constraints. Therefore, improving speed always results in larger area. The most efficient multiplier structure will vary depending on the throughput requirement of the application. The first step of the design process is the selection of the optimum circuit structure. There are various structures to perform the multiplication operation starting from the simple serial multipliers to the complex parallel multipliers.

Optimization of hardware complexity, speed and energy efficiency of digital MAC and its analogue DSP blocks becomes a challenging task with increase in digital components [4, 5]. One such application is designing MAC for FIR filter, which further complicates the worst case delay propagation situation with increased design complexity over its filter order [6]. Many previous works [7, 8] have focused on hardware efficient implementation of FIR filters using various adders and multipliers and also using some DA models like canonical sign digit representation of filter co-efficient. Similarly, a hardware sharing multiplication approach [9] which is a combination of add and shift operations over the common computation results has also been implemented earlier. However, the major problem with this sharing multiplication method is that, as the number of bits used to represent filter coefficients significantly increase, an additional large memory area will be needed

International Conference On Progressive Research In Applied Sciences, Engineering And Technology 32 |Page (ICPRASET 2K18)

for computation sharing. Therefore, in the present work we prefer the multiplication using add and shift method with canonical signed digit (CSD) representation.

#### II. Mac Unit

#### 2.1 Prefix Algorithms

Two categories of prefix algorithms can be distinguished; the serial prefix, and the tree-prefix algorithms. Tree-prefix algorithms include parallelism for calculation speed-up, and therefore form the category of parallel-prefix algorithms. Equation 3.21 represents a serial algorithm for solving the prefix problem. The serial-prefix algorithm needs a minimal number of binary  $\bullet$  operations and is inherently slow (O(n)).

According to equation 3 all outputs can be computed separately and in parallel. By arranging the operations. In a tree structure, the computation time for each output can be reduced to  $O(\log n)$ . However, the overall number of operations need to be evaluated and with that the hardware costs grow with (O(n2)) if individual evaluation trees are used for each output.

As a tradeoff, the individual output evaluation trees can be merged (i.e., common sub-expressions be shared) to a certain degree according to different tree-prefix algorithms, reducing the area complexity to O(n log n) or even O(n). Binary addition has been presented as a prefix computation next. The prefix problem of binary carry-propagate addition computes the generation and propagation of carry signals. The corresponding to the bit generates gi and bit propagate pi signals have to be computed from the addition input operands in a pre processing step.

#### III. Rns System

#### 3.1 Chinese Remainder Theorem

Given a set of pair-wise relatively prime moduli  $\{m1, m2,...mn\}$  and a residue representation  $\{r1,r2,...rn\}$  in that system of some number X, i.e.ri= |X|mi, that number and its residues are related by the equation:

$$\sum_{|X|M=1}^{n} ri_{|Mi-1|mi Mi|M}$$

(1)

Where M is the product of the mi's, and Mi=M/mi. If the values involved are constrained so that the final value X of is within the dynamic range, then the modular reduction on the left-hand side can be omitted.

To understand the formulation of Equation we rewrite X as:

$$X^{\triangleq} \{r1, r2, ..., rn\}$$
  
$$\triangleq \{r1, 0, ..., 0\} + \{0, r2, ..., 0\} + \{0, 0, ..., rn\}$$
  
$$\triangleq X1 + X2 + \dots + Xn$$
(2)

Hence, the reverse conversion process requires finding Xi's. The operation of obtaining each Xi is a reverse conversion process by itself. However, it is much easier than obtaining X.

Consider now that we want to obtain Xi from  $\{0, 0, ..., ri, ..., 0, 0\}$ . Since the residues of Xi are zeros except for ri. This dictates that Xi is a multiple of mj where  $j \neq i$ . Therefore, Xi can be expressed as:

We define Mi as M/Mi, where  $M = \prod_{i=1}^{k} pi$ . Then: ||Mi-1|miMi|mi=1 (4) Since all mi's are relatively prime, the inverses exist:

Xi = |Mi-1|miMi

Xi = riXi = ri|Mi-1|miMi|

To ensure that the final value is within the dynamic range, modulo reduction has to be added to both sides of the equation.



Fig. 1: Block diagram of a reverse conversion

### IV. Fir Filter Based On Rns

Filter realizations using Residue Number System (RNS) is suitable for implementation of high-speed digital signal processing due to their inherent parallelism, modularity, fault toleration and local carry propagation properties. Arithmetic operations like multiplication and addition can be carried out more efficiently as RNS ensures localized carry propagation properties. RNS is particularly suitable for implementing FIR filters where multiplications and additions are the core operations. These features make RNS beneficial for digital signal processing applications, particularly, when large word length and high throughput rate are required.



Fig. 2: FIR Filter using RNS system

Here the multiplier was replaced by the RNS computation unit and addition was performed by the parallel prefix adder. This reduces the complexity of the system and increases the speed of operation.

| Carnet Film Monator                                        | Her Continues                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |                                                                                         |                                                           |
|------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------|-----------------------------------------------------------|
| Seyder Statifuerfil<br>Sele Vin<br>Sele Vin<br>Sele Select | Binetic active)           4         433-500 mm HB           4         430-500 mm HB           4         430-500 mm HB           4         430-500 mm HB           4         1770-193-1640 mm           6         1770-1930-1640 mm           7         1750-1930-1640 mm           8         1750-1930-1640 mm           9         150-1930-1640 mm           9         150-1930-1640 mm           9         150-1930-1640 mm           9         150-1930-1640 mm | NELOLT<br>UNALT<br>ELAND<br>TACLA<br>TACLA<br>TACLA<br>TACLA<br>TACLA<br>UNALT<br>UNALT | •                                                         |
| Persportant Type                                           | - Mar Onia                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | . Property Sectional                                                                    |                                                           |
| A Longest A<br>O regiones A<br>O ferderer                  | # Igenty oner 19                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | 1000 AL                                                                                 | faire a sengit value for<br>more land tetre.<br>Vipeas: 1 |
| Chever                                                     | Densis Facilier 20                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | Passe mon                                                                               | Water 7                                                   |

Fig. 3: Filter coefficient extraction using FDA tool

#### 4.1 FIR –De-noising Application

Here low pass FIR filter coefficient generation is carried out using FDA filter design analyzer tool and preprocessed successfully using MATLAB. And ECG signal is also generated and converted into 2's complement representation and these digital values are stored as a text file using block RAM. Memory unit basically consists of two units: 1) pre-processed FIR coefficients; 2) Noisy ECG signal component.



Fig. 5: FIR filter simulation for ECG de noising.

#### V. Experimental Results

Here we compare the performance and speed trade off over area constrains of various MAC units using compound prefix adder with existing well known benchmark multiplication schemes which explored in table 1 and 2 with the schemes described in section 1. We extended this analyzes using FIR filter implementation schemes and implemented them for DSP applications as ECG signal de-noising. The hardware synthesis was carried without to prove the core objective of this work with improved performance report of the aforementioned designs, and its metrics over architectural level modifications and also the highest achievable complexity reduction and frequency.

| Table 1 | Trade off | measures | of prefix | methodologies |
|---------|-----------|----------|-----------|---------------|
|---------|-----------|----------|-----------|---------------|

| Adder type               | AREA | Fmax report |
|--------------------------|------|-------------|
| FIR MUL - Brent Kung     | 426  | 244.32 MHz  |
| FIR MUL - Han Carlson    | 243  | 116.01 MHz  |
| FIR MUL - Kogge stone    | 748  | 234.3 MHz   |
| FIR MUL - Ladner Fischer | 438  | 310.95 MHz  |

Table 2 Trade off Trade off measures of various MAC units

| Multiplier type        | AREA | Fmax report |
|------------------------|------|-------------|
| FIR MUL - Wallace tree | 1591 | 268.46 MHz  |
| FIR MUL - Vedic        | 664  | 83.13 MHz   |
| FIR MUL - shift/add    | 859  | 307.03 MHz  |
| FIR MUL - CSD          | 601  | 131.98 MHz  |

Here FIR MAC designs are synthesized to ALTERA cyclone III FPGA using Quartus II tool. Our booth recoded consumes 28.85% and 39.40% lower than the existing methods. Our design also has a critical path delay of 1.35ns, which is several times faster than all other existing MAC types. The efficiency could be consistent with nominal higher bit width operands. As it stands, the performance metrics of proposed bit serial shift-add based MAC network makes it a prerogative choice for DSP application.

#### VI. Conclusion

The main aim of this dissertation was designing RNS based high speed reverse converter. The achieved results and key outcomes show that the delay and the power consumption were reduced in the proposed system using parallel prefix adder. This system is applied to the FIR Filter design which shows the improved speed and less power consumption. Hence the efficiency of the proposed designs has been proven. To extend the proposed model that combines adding techniques to carried out FIR filtering in ECG signal de-noising applications. The effectiveness of our improved RNS coded MAC is verified using FIR implementation unit. And finally the complete trade of metrics of RNS multiplier unit with prefix topology based accumulation unit is validated using hardware synthesis.

#### References

- S. Ravi Chandra Kishore, K.V. Ramana Rao "Implementation Of Carry-Save Adders In Fpga"Ijeat Issn: 2249 8958, Volume-1, Issue-6, August 2012.
- [2] Ms. V.N. Chaudhary, Prof. Dr. P.R. Deshmukh "Analysis And Implementation Of Low Power Wallace Tree Multiplier" Jikrece Nov 10 To Oct 11 | Volume – 01, Issue-02.
- [3] Simran Kaur And Mr. Mansul Bansar " Fpga Implementation Of Efficient Modified Booth Wallace Multiplier "Thaipur University, Pune, June 2011.
- [4] J.-L. Beuchat And J.-M. Muller, "Automatic Generation Of Modular Mul-Tipliers For Fpga Applications," Ieee Transactions On Computers, Vol. 57, No. 12, Pp. 1600–1613, December,2008.
- [5] Nandi, A., Saxena, A.K., Dasgupta, S.: 'Enhancing Low Temperature Analog Performance Of Underlap Finfet At Scaled Gate Lengths', Ieee Trans. Electron Devices, 2014, 61, (11), Pp. 3619–3624.
- [6] Nandi, A., Saxena, A.K., Dasgupta, S.: 'Analytical Modeling Of Double Gate Mosfet Considering Source/Drain Lateral Gaussian Doping Profile', Ieee Trans. Electron Devices, 2013, 60, (11), Pp. 3705–3709.
- [7] Wu, L., Cui, Y., Huang, J.: 'Design And Implementation Of An Optimized Fir Filter For If Gps Signal Simulator'. Ieee Conf. On Microelectronics And Electronics, September 2010, Pp. 25–28.
- [8] Lim, Y.C., Parker, S.R.: 'Fir Filter Design Over A Discrete Power-Of-Two Coefficient Space', leee Trans. Acoust. Speech Signal Process., 1983, Assp-31, Pp. 583–591.
- Samueli, H.: 'An Improved Search Algorithm For The Design Of Multiplierless Fir Filter With Powers-Of-Two Coefficients', Ieee Trans. Circuits Syst., 1989, 36, Pp. 1044–1047.

- [10] Park, J., Muhammad, K., Roy, K.: 'High- Performance Fir Filter Design Based On Sharing Multiplication', Ieee Trans. Very Large Scale Integr. (Vlsi) Syst., 2003, 11, (2), Pp. 244-253
- Sunil, M., Ankith, R.D., Manjunatha, G.D., Et Al.: 'Design And Implementation Of Faster Parallel Prefix Kogge Stone Adder', Int. [11] J. Electr. Electron. Eng. Telecommun., 2014, 3, (1), Pp. 116-121.
- [12] Yezerla, S.K., Naik, B.R.: 'Design And Estimation Of Delay, Power And Area For Parallel Prefix Adders'. Int. Conf. On Recent Advances In Engineering And Computational Sciences, 2014
- [13]
- Gedam, S.K., Zode, P.P.: 'Parallel Prefix Han-Carlson Adder', Int. J. Res. Eng. Appl. Sci., 2014, 2, (2), Pp. 81–84 Satish, C., Arur, P.C., Kumar, G.K.: 'An Efficient High Speed Wallace Tree Multiplier', Int. J. Emerging Trends Electr. Electron., [14] 2014, 10, (4), Pp. 38-42
- [15] Sankar, D.R., Ali, S.A.: 'Design Of Wallace Tree Multiplier By Sklansky Adder', Int. J. Eng. Res. Appl., 2013, 3, (1), Pp. 1036-1040